Unsupervised Approaches to Text Correction Using Google N-grams for English and Romanian

نویسندگان

  • Diana INKPEN
  • Aminul ISLAM
چکیده

We present an unsupervised approach that can be applied to test corrections tasks such as real-word error correction, near-synonym choice, and preposition choice, using n-grams from the Google Web 1T dataset. We present in details the method for correcting preposition errors, which has two phases. We categorize the n-gram types based on the position of the gap that needs to be replaced with a preposition. We also consider the normalized frequency values of the candidate prepositions. Our experimental results are better than those reported in related work, on the same test set that was also used in related work. We applied our method to English, but it can be easily applied to Romanian. Google released n-gram counts for 10 European languages, including Romanian. As test set, a part of a Romanian corpus can be used. A subset of prepositions needs to be chosen. If we include prepositions that consist in two words, the algorithm needs to be adjusted.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TrWP: Text Relatedness using Word and Phrase Relatedness

Text is composed of words and phrases. In bag-of-word model, phrases in texts are split into words. This may discard the inner semantics of phrases which in turn may give inconsistent relatedness score between two texts. TrWP , the unsupervised text relatedness approach combines both word and phrase relatedness. The word relatedness is computed using an existing unsupervised co-occurrence based...

متن کامل

Text Similarity Using Google Tri-grams

The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data s...

متن کامل

Representation of textual documents by the approach wordnet and n-grams for the unsupervised classification (clustering) with 2D cellular automata: a comparative study

In this article we present a 2D cellular automaton (Class_AC) to solve a problem of text mining in the case of unsupervised classification (clustering). Before to experiment the cellular automaton, we vectorized our data indexing textual documents from the database REUTERS 21,578 by Wordnet approach and the representation of text documents by the method n-grams. Our work is to make a comparativ...

متن کامل

University_Of_Sheffield: Two Approaches to Semantic Text Similarity

This paper describes the University of Sheffield’s submission to SemEval-2012 Task 6: Semantic Text Similarity. Two approaches were developed. The first is an unsupervised technique based on the widely used vector space model and information from WordNet. The second method relies on supervised machine learning and represents each sentence as a set of n-grams. This approach also makes use of inf...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010